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The New York Public Library is participating in the Chronicling America initiative to develop an online 
searchable database of historically significant newspaper articles. Microfilm copies of the newspapers are 
scanned and high resolution Optical Character Recognition (OCR) software is run on them. The text from 
the OCR provides a wealth of data and opinion for researchers and historians. However, categorization of 
articles provided by the OCR engine is rudimentary and a large number of the articles are labeled "edi- 
torial" without further grouping. Manually sorting articles into fine-grained categories is time consuming 
if not impossible given the size of the corpus. This paper studies techniques for automatic categorization 
of newspaper articles so as to enhance search and retrieval on the archive. We explore unsupervised (e.g. 
KMeans) and semi-supervised (e.g. constrained clustering) learning algorithms to develop article categoriza- 
tion schemes geared towards the needs of end-users. A pilot study was designed to understand whether there 
was unanimous agreement amongst patrons regarding how articles can be categorized. It was found that the 
task was very subjective and consequently automated algorithms that could deal with subjective labels were 
used. While the small scale pilot study was extremely helpful in designing machine learning algorithms, a 
much larger system needs to be developed to collect annotations from users of the archive. The "BODHI" 
1— H system currently being developed is a step in that direction, allowing users to correct wrongly scanned OCR 

^ and providing keywords and tags for newspaper articles used frequently. On successful implementation of 

the beta version of this system, we hope that it can be integrated with existing software being developed 
for the Chronicling America project. 

Categories and Subject Descriptors: C.2.2 [Computer Science-Machine Learning]: Unsupervised Learn- 
ing 

General Terms: Machine Learning Algorithms, Subjectivity, Unsupervised Learning 

Additional Key Words and Phrases: human judgement, annotation, unsupervised learning, subjective an- 
notation, kmeans, constrained clustering 
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1. INTRODUCTION 

Chronicling America (http://chroniclingamerica.loc.gov/) is a website that provides 
access to over 3 million historic newspapers scanned by the National Digital Newspaper Pro- 
gram (NDNP). It is an initiative of the National Endowment for Humanities (NEH) and the 
Library of Congress (LC) and its goal is to develop an online, searchable database of his- 
torically significant newspapers between 1836 and 1922. State libraries, historical societies 
and universities have been funded by the NEH to generate scanned newspaper pages repre- 
senting the state's regional history, geographic coverage, and events which will be archived 
by the LC. The New York Public Library (NYPL) is part of this initiative and has scanned 
200,000 newspaper pages published between 1860 and 1920 from microfilm. 

While the images scanned by the NYPL are sent to the Library of Congress for hosting 
on the Chronicling America website, duplicate copies of the archive are also available at 
NYPL for use by its patrons. These digitized pages offer a wealth of data for researchers, 
historians, genealogists and other patrons of the library. For example, the opening of the 
Brooklyn Bridge in 1883, the construction of an immigration station at Ellis Island in 1890, 
the historic opening of the Metropolitan Opera House on Broadway and 39th Street are 
some interesting local news items of the time. Other topics widely covered by the American 
press include presidential administrations (Cleveland (1885 - 1894), Garfield (1880 - 1883), 
McKinley (1897 - 1901), Theodore Roosevelt (1901 - 1912)), natural calamities (Galveston 
flood of 1900, San Francisco earthquake 1906), the sinking of the Titanic, events pertaining 
to the first world war and news from the world of medicine (patented medicines, spread of 
epidemics, new discoveries). To effectively use this archive, developing sophisticated search 
and retrieval mechanisms is crucial. 

In order to make a newspaper available for searching on the Internet, the following processes 
must take place: 

(1) The microfilm copy or original paper is scanned. 

(2) Master and Web image files are generated. 

(3) Metadata is assigned for each page to improve the search capability of the newspaper. 

(4) Optical Character Recognition (OCR) software is run over high resolution images to 
create searchable full text, and 

(5) OCR text, images, and metadata are imported into a digital library software program. 

The newspaper archives can currently be searched using the OpenSearch protocoQ Unfor- 
tunately, these search facilities are rudimentary and irrelevant documents are often more 
highly ranked than relevant ones. For instance consider a search for a natural calamity like 
the April 18th, 1906 San Francisco earthquake which killed approximately 2000 people and 
measured 7.8 on the Richter scale; if the keywords "earthquake San Francisco" are entered 
as the search criteria in the digitized newspaper archive along with a date range 01/01/1906 
to 12/31/1906, the first document returned is Page 7, April 19th of the 1906 Los Angeles 
Herald with an article "Fifty people killed at San Jose" (the word "San" is tagged); the 
second document returned is the June 3rd, 1906 issue of "The San Francisco Sunday Call" 
with a full-page illustration of a drama by Frederick Irons Bamford and the third document 
is page 13, June 3rd, 1906 issue of "The Sunday Call" with a cartoon of "Major ozone's 
fresh air crusade". The retrieval technique missed finding Page 1, April 19th, 1906 of the 



http : //www . opensearch. org/Home 
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newspaper "The Sun" published from New York, which had a headline article "Earthquake 
lays Frisco in ruins" . 



Table I. The Original Text in the article versus the scanned OCR. 



ARMY BILL PROVISIONS. 


JLRMY BILL PROVISIONS 


INCREASE AND REORGANIZATION 
RANKS MAY REACH ONE HUNDRED THOU- 
SAND IN EMERGENCIES-FILIPONOS AND 
PORTO RICANS TO BE ENLISTED. 


INCREASE AND REORCAMZATICN 
HANKS MAT REACH ONE HUNDRED THOU 
SAND IN EMERGENCIES Ai FILIPINOS AND 

PORTO RICANS TO BE ENLISTED. 


[BY TELEGRAPH TO THE TRIBUNE] 
Washington, Nov. 30-The approved Army In- 
crease and Reorganization bill agreed upon by 
Secretary Root and General Corbin on the one 
hand and the Senate and House Military com- 
mittees on the other is an elaborate measure of 
thirty-nine sections, providing for a detailed in 

stead of a permanent staff; for an artillery 
corps, coast and field, instead of regiments; for 
an Army of 60,000, including officers to be in- 
creased in ranks to 100,000 in emergencies, with 
Congressional approval; 


IBT TCLEGSAPB TO TCS TRIBUNE] 
"Washington, Nov. Ai-he approved Army in- 
crease and Reorganization bill agreed u(^,on by 
Secretary Root and General Corbin on the one 
hand and the Senate and House Military com- 
mittees on the other is an elaborate measure of 
thirty-nine sections, providing for a detailed in 

stead of a permanent staff; for an artillery 
corps, coast and field, instead of regiments; for 

en Army of Go. j"-" including of3eers to be in 
creased in ranks to 100.000 in emergencies, with 
Congressional approval; 



On investigating the reasons why irrelevant documents are ranked higher, it was found 
that the newspapers are scanned on a page-by-page basis and article level segmentation 
is poor or non-existent. The OCR scanning process is far from perfect and the documents 
generated from it contains a large amount of garbled text. Table [I] (Left) shows the text 
in an article and (Right) the garbled text generated by the scanning process. In addition, 
categorization of article level data using the OCR software was not very successful - most of 
the articles are labeled "editorial" and there is no fine grained classification into categories 
such as crime, politics, medicine, etc. For example, an attempt to categorize articles in the 
edition of The Sun newspaper published on November 4th, 1894 resulted in 338 articles 
classified as editorial, 32 unclassifiecQ 10 sports, 23 advertising, 5 commercial, 3 birth- 
related announcements, and 2 reviews. 

Even though the New York Public Library has put in substantial effort into improving the 
quality of the images and text obtained from OCR by testing each word scanned against 
english dictionaries, manually re-typing newspaper headlines and applying categories to 
articles - the digital outputs from the OCR software are not good enough to ensure adequate 
quality of text retrieval or to meet user expectations. Consequently, we conjectured that 
a crowdsourcing project involving patrons of the library who use the archive frequently 
for research and learning could assist the task of categorizing articles of this huge online 
repository. 

This paper describes our experiences when setting up a pilot study on a randomly chosen 
sample of newspaper articles from the archive. Annotators were asked to browse the articles 
and come up with broad categories into which they thought the articles could be grouped; 
they were not given a pre-defined list of categories to choose from. An attempt was made to 
leverage the information provided by them (such as keywords and tags, labels) and study 
whether popular unsupervisecj^jmachine learning and text mining algorithms benefited from 
such prior knowledge. In addition, annotators are evaluated based on the degree to which 
the additional knowledge they provide helps the learning task. 

This paper is organized as follows: Section [2 presents related work, Section [3] describes 
the characteristics of the data, Section [4] provides details of the pilot study and interprets 

2 These were later identified as banners of the newspaper. 

3 Since well-defined labels are not easily available from the archive, unsupervised machine learning algorithms 
were preferred to the well known supervised counterparts. 
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the results, Section [5] describes algorithms for incorporating domain knowledge into learning 
tasks and empirical analysis performed on this data, Section [6] describes the implementation 
of a system for correcting OCR text and collecting tags and Section [7] concludes the paper. 

2. RELATED PRIOR WORK 

In recent years, the humanities have seen a transformation of scholarly information from 
physical media to digital form, resulting in the formation of large digital libraries (such as 
ARTstoiQ Data-PASSjf] Biodiversity Heritage LibrarjQ English Broadside Ballad Archive^] 
Great War Primary Documents Archive]^] National Archives of London^ ) . Vast quantities 
of datasets, manuscripts, reports and newspapers that might never have made their way into 
a traditional library are being made accessible to the general public using high-performance 
computing, networking and storage. 

Several digital humanities projects that have used machine learning and natural language 
processing techniques to learn from historic newspaper archives are relevant to this work 
- the libraries of Richmond and Tufts have examined the Richmond Times Dispatch dur- 
ing the civil war years for more than two decades and their work focuses on automatic 
identification and analysis of full OCR text in new spapers to provide advanced searching, 
browsing and visualization Crane and Jones 2006 . The focus of this work was on named 



entity extraction and ten categories prominent in these newspapers were studied including 
ship names, railr oads, streets a n d organizations. In an earlier project a t the universities, the 
Perseus project ( |Smith 2002a| , Smith 2002b| , Smith and Crane 2001] ) , a general system to 
extract dates and names from text was developed. At the Hull Digital Librar}p°| information 
capture and sema ntic indexing of the newspaper archive is done using do cument analysis 
and c lassification Esposito et al. 1997] . In Collection OCR, Sankar et. al. Sankar K. et al. 
[20101 use an approximate fast nearest neighbor algorithm based on hierarchical K-Means 
(HKM) to clean OCR text. In general, the focus of almost all of the prior work in digital 
humanities has been on language modeling and has not focused on categorization of articles 
taking into consideration subjectivity of human annotation. Finally, it must be noted that 
a preliminary version of this work using unsupervised learning algorithms was published 



Dutta et al. 2011 and the problem of topic evolution in the historic newspaper archive 



over time was explored by Lee et. al. Lee et al. 2010]. 



Recruiting web users to tag and annotate text and images in large archives, especially when 
there are too many documents and objects for a single authority to label has become com- 
mon practice. Several projects use the collective effort of a large number of people to label 
text, images and annotate maps; with the advent of crowdsour ci ng services (such as Am a- 



zon's Mechani cal TuriF] reCAPT CHA ( |von Ahn et~aT 



2008 



Faymonville et al. 2009 ) 



aym 

the LIST EN (|Turnbuli~et al. 20071 ), ESP'"( |von Ahn and Dabbish 2004| ) and Games with a 
Purpose ( |Ahn 2008| ) paying people small sums of money to do "Human Intelligence Tasks" 
(HITs) is also becoming the norm. Such tasks include anything from labeling images, to 
listening to short pieces of audio, researching topics on the internet and scrubbing database 
records. A number of recent papers have evaluated the effectiveness of using Mechanical 
Turk to create annotated data for text and natural language processing applications (see 
for e.g. Snow et al. 2008 , papers accepted at the "Creating Speech and Language Data 



www. artstor . org 
5 http : //www . icpsr . umich . edu/DATAPASS/ 
6 biodiversitylibrary . org 
7 ebba. english.ucsb . edu 
8 www . gwpda . org 
9 nat ional archives .gov .uk 

1( 'http : //www2 .hull . ac .uk/lli/libraries . aspx 
xl https : //www .mturk. com/mturk/welcome 
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With Amazon's Mechanical Turk" Workshop co-located with HLT-NAACL 2010). While 
it is relatively inexpensive to obtain a large number of labels from annotators, the question 
of how to incorporate them into machine learning algorithms with theoretical guarantees 
on performance remains an open research problem. 

In the context of supervised learning tasks, Smyth et al. were the first in the machine 
learning community to propose a solution to t he problem of noi sy labels in the context of 



labeling vol canoes in satellite images of Venus ( Smyth et al. 1994 , Burl et al. 1994 , Smyth 



et al. 1996 ). They first estimate the ground t ruth and then use p robabilistic reasoning to 



learn the classifier. Raykar and his colleagues Raykar et al. 2009 describe a probabilistic 



approach when multiple annotators provide possibly noisy labels but there is no absolute 
gold standard. Their algorithm iteratively establishes a particular gold standard, measures 
the performance of annotators given that gold standard and then refines it based on perfor- 
mance metrics. Key assumptions made by them include: a) performance of each annotator 
does not depend on the feature vector and b) conditioned on the truth the experts are 
in dependent. 

Sheng et al. 2008 analyzed when it is worthwhile to acquire new labels for some training 
examples. They show that repeated labeling can improve label quality, but not always; when 
labels are noisy, repeated labeling can be preferable to single labeling even in the traditional 
setting where labels are not particularly cheap. An empirical study to examine the effect of 
noisy annotations on the performance of sentiment classification models was performed by 



Hsueh et al. 2009 



can be found in 



More theo r etical work on when it is u s eful to deal with multiple e xp ert s 
iugosi 19921 



Dekel and Shamir 2009b] , |Dekel and Shamir 2009a] . 
Another class of algorithms that needs discussion are semi- supervised clustering algo- 
rithms. These algorithms fall in between totally unsupervised learning and totally super- 
vised learning. The primary goal of such algorithms is to "steer" the clustering process 
with user feedback; also the clusters obtained by guidance from humans enable the users 
to play with the data and understand it intuitively. Semi-supervised clustering algorithms 
can be broadly c lassified int o two main genres: semi-supervised clustering of (a) labels and 
(b) constraints ( |Basu 20"05] ). 

Us ing labeled data, iterative feedback from users has been incorporated by |Cohn et al. 
20031 & n d methods for usin g conditional distributions in auxiliary space are reported in 
Sinkkonc n and Kaski 200"2j. Seed ing mechanisms for s emi-supervised clustering have been 
studied by Basu et al. 2002| and |Wagstaff et al. 20*01 . For iterative clustering algorithms 
(such as K- Means), a common technique is to seed at ran dom by arb itrarily creating K 
partitions and choosing the mean of each partition as seeds. Forgy 1965] proposed a variant 
that chooses K instances at random as s eeds, then assign s the remaining instances to the 



cluster represented by the nearest seed. MacQueen 1967 recalculates the centroids after 



the assignment o f instances to the cluster represented by the nearest seed. Kaufman and 



Rousseeuw 1990 



propose an elaborate mechanism of seed selection: the first seed is the 
instance that is most central in the data; the rest of the representatives are selected by 
choosing instances that promise to be closer to more of the remaining instances. Other 
interesting seeding mechanisms include the Buckshot metho d of doing hierarchica l clustering 
on a sample of data to get initial set of cluster centers ( |Cutting et al. 19 92 1), sel ecting 



the k densest intervals along each co-ordinate to get the k cluster centers (|Bradley et al. 



1997 



) and r efining the initial seeds by taking into accou nt the modes of t he underlying 



distribution ( Bradley and Fayyad 1998 ). In K-Means++ ( Arthur and Vassilvitskii 2007 



the random starting points are chosen with specific probabilities. By augmenting K-Means 



using this simple, randomized seeding technique, K-Means+- 
the optimal clustering. 



is 9 (log K) competitive with 



2 http: / / sites.google.com/site/amtworkshop2010/home 
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In semi-supervised clustering with constraints, the focus is on either: (a) similarity adapt 
ing or (b) search-based methods. In similarity-adapting methods, an existing clustering al 
gorithm using some similarity measure is employed, but the similarity measure is adapted 
to suit the problem (such as the Jensen-Shannon divergence trained with gradient d escent 
Cohn et al. 2003] , the Euclidean distance modified by a shortest-p ath algorithm |Klein 



et al. 2002[, Mahalanobi s distances adjusted by convex optimization Bilenko and Mooney 



2003 , Xing et al. 2002 . In search-based methods, the clustering algorithm itself is mod 
ified so that user-provided constraints can be used to bias the search for an appropriate 
clustering. This can be done in several ways, such a s by performing a transitive closure of 



the constraints and using them to initialize clusters Basu et al. 2002 , by i ncluding in the 



cost f unction a penalty for lack of compliance with the specified constraints Demiriz et al 



1999 , o r by requiring constra ints to be satisfied during cluster assignment in the clustering 



process 



Wagstaff et al. 2001 



Pair wise constrained se mi-supervised clu s tering has also been studied by Bansal et al. 



2004], |Blum et al. 2004], |Kulis et al 
et al. 2006a| 



2009], [Lange et al. 2005], [Ge et al. 2007], |Davidson 



ia]) and Charikar et al. 2003| and a r elated approach based on Gaussian random 
ills presented by [Zhuxj ct al. 2003 . A probabilistic m odel for semi-supe rvised 



field model 

clustering based on Hidden Markov Random Fields was studied by [Basu et al. 2004] where 
the goal is to perform partitional semi-supervised clustering of data by minimizing an ob- 
je ctive function deriv ed from the posterior energy of the HMRF model. 

[Cohn et al. 200*3] note that "semi-supervised clustering assumes that human user has 
in their mind criteria that enable them to evaluate the quality of clustering" . It does not 
assume that the user is conscious of what they think defines a good clustering but that, as 
with art, they will "know it when they see it". While this is very interesting, an important 
observation is that the subjectivity of labels assigned by humans is not discussed in the 
context of evaluation of semi-supervised clustering; furthermore, several open questions 
exist including incomplete seeding techniques (what if some cluster labels are not observed 
in the labeled set available) and how to deal with noise in the data. 

3. THE DATA 

Figure [I] shows a scanned newspaper page (The Sun, November 2, 1894) from the NYPL 
archive and an article from this paper. The historical newspaper archive contains two types 
of XML files: (1) Page-Level XMLs: For each page of a newspaper, there is an XML file 
that contains metadata about the page and the text in it. Each word scanned is stored as a 
string in the page level XML (See Figure [2]) along with possible alternative suggestions for 
the worcp*] The text is extracted from the page level XML using Xpath queries and stored 
in a PostGreSQL database. (2) Issue-Level XMLs: The issue-level XMLs (illustrated in 
Table [3]) provide the following information about articles: (a) Headlines cleaned by humans 
which are of much higher quality than the text produced by the OCR software, (b) Article 
segmentation information: Each newspaper article is represented as a collection of one or 
more text blocks whose pixel coordinates are available. This helps to determine where one 
article ends and the next one begins and is particularly useful when an article spans more 
than one page, (c) High-level categorization of the articles produced by the OCR software. 
We have access to only a subset of the NYPL archivj^]- issues of The Sun newspaper from 
November 1, 1894 to December 31, 1894. (d) In addition, issue level XMLs also store the 
date of publication, volume number and issue number and provide pointers to the location 
and names of the page-level XML files. 

Table [4] shows all the categories found by the OCR software for "The Sun" newspaper 
published between November 2nd, 1894 - December 31st, 1894. Articles in the "edito- 



13 We found that its primary selections are usually better than their alternatives. 
14 These have been substantially cleaned by humans 
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GRANT ON THE EAST SIDE. 



IIK T4 %M» rmrii i AT WtU$9M 

ntttsTHnricv n.4i.in. 

■«rlTr< <Htk I.ifcn.i... m n~.k tk* Mis 
I «lll«l>|< WarftinB- 
mr* mt *»k»l Hill II ■ • I>mi nir Tkia 

A bin and rjUbu«ln-T n- rrnw*1 of trnrkliiffniPn 
mM Uwt nlflit In NIlMon llnll. tn r'litwnth 
ntrift, n«»r Thlnl »v»tiii*. I" ratify (lio I ■••in... 
mite nomlnallnng. Til* hall wm rrnwitt<) to 
th« dimm. and NMMt IMtt «a* M m .-rflnw 
mrvllnf whlrh brought liwrtlipr arvrral bun- 
4ml mnn. 

Tha tnwtlnf waa np^nwl by John I aiVrr nff 
tha Nil-am lit lira' I'nhin. John J. IlntinHly of 
lb» llrlrklayrn* t'nlun wa* rlrrtnl ( Imlrmiiii, 

ami Kdiranl K. Kun»* of tiUtrlri A<» h i'i 

SirrfUry. Attxm J. Cumiiilnita wm ll.e rtrnl 
II' ■»«"! 

" It I** lanri of tin) Typographical t " nlori that 
nnlnl'in to onf In lha ronc'im of all : and a 
trnt't 1( l» In labor nrifithlxailnnfl. WhMhrrlhay 

havr giirni rtit thl* *vl*wlnm from thr Mil f 

UMtfMM or from latlonn of unployir*. I 

know not, lint crrtaln It la Itinl an Injury to » 
wnrklnifiuan I* the < onrrrn nf all. AikI in In- 



Fig. 1. (Left) A newspaper page from the NYPL archive. The red-border shows an article from the news- 
paper, zoomed in on the right hand figure. 



- <TextBlock Mlln5:ns4='http://i™w.w3.org/1999/xlink' ID="TB_l20101t6319r28650b51684" HEIGHT ="18263" WIDTH="2823' HP05="6639" VP0S="2544" 
ns4:type="simple" language="en n > 
- sTextLine HEIGHT -196.0" WIDTH="2592.0" HPOS='6764.0" VP0S="2680.0" j 

- <String STYLEREFS= - TSZ10" HEIGHT="192.0" 1VIDTH="648.0" HPOS='6764.0" V=yS 2684.0 CONTENT -'GRANT" V.:: '1.0': 

ALTERNATIVE >GHANTt:/ ALTERNATIVE > 
{/String} 

<SP W1DTH="119.D' HPOS="7412.0" VP0S="26B4.0" /> 

<Stnng STYLEREFS="TSZ10" HEIGHT ="188.0" 1VIDTH="244.0" HPOS="7532.0" VP0S="2680.0" CONTENTION" WC='1.0' /> 
<SP WIDTH="103.0" HPOS="7776.0" VP0S="2680.0" /> 

- {String STYLEREFS= - TSZ10" HEIGHT ="188.0" IMQTH="3B8.0" HPOS='7880.0" VP0S="2680.0" CONTENT ='THE" WC="1.0"> 

<ALTERNATIVE>TilE</ALTERNATIVE> 
</String> 

<SP WIDTH="107.0' HPOS="8268.0" VP0S="2680.0" /> 

<String STYLEREF5="T5Z10" HEIGHT ="188.0" 1VIDTH="4B0.0" HP05="8376.0" VP0S="2680.0" CONTENT ="EAST" V.i: '1.0 /> 
<SP WIDTH="75.0" HP0S="8B56.0" VP0S="2680.0' /> 

<String STYLEREFS="TSZ10" HEIGHT="188.0" MDTH="424.0" HP05="8932.0" VP0S="2684.0" CONTENT -'SIDE" WC= - 1.0 - /> 

</TextLine> 

Fig. 2. Page Level XML file showing the article with headline "GRANT ON THE EAST SIDE" . 



rial/opinion" and "sports" categories contain statistically significant amounts of text while 
reviews, illustrations, birth/death/wedding announcements are not included in our study. 
Pre-processing: We preprocess the documents to reduce dimensionality and have clean 
data to learn from. For each article, a bag-of-words representation and tf-idf weights are 
obtained. Stop words such as "the" , "and" , etc. are removed from the set of words. Letters 
of length three or less and words that contain digits or repeated characters (e.g. "paaa" 
and "ornnn") are also removed. After applying the above noise reduction techniques, the 
dimensionality of the feature space is 3210. 
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- <dmdSec ID="artModsBib_l_3"> 
- <mdWrap MDTYPE-'MODS" LAB EL -'Article metadata"> 

- <xmlData> 
- <mods:mods> 

- <mods:detail type="headline"> 

<mods:text>Grant on the East Side</mods:text> 
</mods:detail> 

- <mods:detail type- 'classification" > 

<mods:text>article/opinion-editorial</mods:text> 
</mods:detail> 

- <mods:detail type="pageldentifier"> 

<mods:text>pageModsBibl</'mods:text> 
</mods:detail> 
</mods:mods> 
</xmlData> 
</mdWrap> 
</dmdSec> 

Fig. 3. A segment of the issue- level XML file illustrating the OCR Classification (as "article/editorial") for 
the article and its headline. 



Category 


Article Counts 


Editorial / Opinion 


11,441 


Sports 


764 


Advertising 


683 


Commercial/Legal/Public notices 


361 


Birth/Death/ Wedding 


158 


Reviews 


45 


Illustrations 


2 


Unclassified 


785 


Total 


14,239 



Fig. 4. Top-level categories of articles from OCR software for The Sun newspaper between November 2nd, 
1894 and December 31st, 1894. 

4. CASE STUDY INVOLVING HUMAN ANNOTATORS 

We conducted a pilot study to test whether the category labeled "article/editorial" by 
the OCR software could be further broken down to more meaningful sub-categories. Six 
annotators were recruited to determine the number of natural categories found in a random 
sample of twenty- five articles. The articles (all labeled article/editorial by the OCR software) 
were selected from the November 2nd, 1894 issue of The Sun newspaper. All the annotators 
were given the same set of articles to work with. They were asked to skim the articles first 
and group them into obvious and intuitive categories and focusing on the "bigger picture" . 
The defined categories had be described in 5-10 words and preferably had to include words 
from the articles. Finally, they were interviewed with the following set of questions: 

(1) What was the strategy you used for coming up with the categories? 

(2) Were there any documents that you found difficult to assign to categories? 

(3) Did you find any part of the study particularly difficult or ambiguous? If so, describe 
the problem you faced. 

(4) How long did it take you to complete the study? 

(5) If you had the opportunity to change anything with this study, what would it be? 

While there are many other interesting research questions that can be investigated with 
human annotated data, the focus was on determining a meaningful number of sub- categories 
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Table II. Sub-Categories found by humans in 
the random sample of 25 articles labeled "ar- 
ticle/editorial" by the OCR software. 



ID 


Number of sub-categories 


Annotator 1 


8 


Annotator 2 


14 


Annotator 3 


13 


Annotator 4 


9 


Annotator 5 


10 


Annotator 6 


9 



for the "article/editorial" category; thus reaction times, self-consistency among annotators 
were not emphasized. 

4.1. Interpreting Results from the Pilot Study 

Table |n] shows the number of categories found by the annotators. The November 2nd, 1894 
newspaper was published immediately after general elections; thus a lot of articles in this 
issue had to do with politics and elections. This is also reflected in the random sample 
used for the categorization task - annotators unanimously agreed that seven of the twenty 
five articles used for the study belong to the category "politics/elections/governmental 
appointments" . Three of the annotators found hierarchies among this category such as 
"politics/ballot, politics/election, politics/nomination, politics/war, politics/social, poli- 
tics/entertainment, politics/gossip" . This accounted for the increased number of total cat- 
egories they listed. Since the instructions explicitly mentioned focusing on the "bigger pic- 
ture" and not drilling down to very fine-grained categories, these were merged together to 
form the category "politics" . Annotators also agreed unanimously on one article belonging 
to the category "medicine, public-health, and safety" . This article presented a report on a 
new diphtheria remedy and announced the arrival of fresh serum from Germany which was 
tried on two cases in Philadelphia. They merged together articles that contained arts, biogra- 
phies, book reviews and the like into one category called "arts/human interest". Creating 
a homogeneous category for these articles was not easy due to the wide variety of articles. 
Articles pertaining to "death" and "marriage announcements" were binned into separate 
categories. There was no agreement among annotators on eleven articles - for example, 
an article with a headline "President Cleveland goes hunting for squirrels" was labeled as 
belonging to the following categories: human interest /politics/sports/entertainment/social. 
All of these eleven articles had a much higher level of ambiguity and there was no agree- 
ment among annotators. Since we did not have categories pre-defined for the annotation 
task and chose rather to let annotators come up with appropriate categories by themselves, 
computing agreement on these articles was not straightforward. 

In essence, sixf^] sub-categories for the "article/editorial" OCR category were found by 
human annotators and are illustrated in Table [TTT] It must be noted that in this application, 
it is hard to obtain "ground truth" or a "gold standard" which can be used for further 
labeling. Consequently, we are forced to rely on the subjective opinion of annotators who 
sometimes disagree on labels. The six categories described above are also referred to as 
inferred ground truth labels in later sections of this paper. There is considerable interest in 
the research community on w hether this subjective labeling at low cost is indeed useful for 



machine learning algorithms Raykar et al. 2009 Hsueh et al. 2009 

The interview section of the annotation task indicated that small or singleton categories 
lead to less agreement among humans; these outliers do not fit into a larger category easily 



3 This is the value of K chosen for our experiments in later sections. 
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Table III. Sub-Categories formed by humans in the random sample of 25 arti- 



cles. 

ID Category Article counts 

1 politics, elections, governmental appointments 7 

2 medicine, public health and safety 1 

3 death 3 

4 arts, human interest, entertainment 2 

5 marriage 1 

6 Other 11 



and this raised confusion and difficulty in categorization. Thus, learning from more examples 
of similar kind was the norm. 

Many of the annotators based the initial decision of the number of categories by reading 
the headlines of the articles and making notes; a feedback loop was almost always involved 
where annotators refined the initial estimates based on more careful and thorough reading of 
the articles. Finally, since it was not clearly indicated whether an article is allowed to belong 
to multiple categories, this question was raised by several annotators. The time recorded 
by annotators indicates that it took anywhere between 45 mins - 2 hrs to categorize all the 
25 articles. 



5. INCORPORATING PRIOR KNOWLEDGE INTO CLUSTERING ALGORITHMS 

The information provided by annotators, albeit subjective, provides important insights 
about how documents in the archive can be grouped together. Thus, studying whether 
this domain knowledge can be incorporated into automated document clustering algorithms 
would be beneficial. 

Clustering (or unsupervised learning) is ubiquitously used in machine learning problems 
where labels are not easily available or are difficult to generate. Given a data set X, a par- 
titional clustering algorithm groups the data into K block sets and thus provides structure 
to previously unstructured data. This is particularly useful in the context of the NYPL his- 
toric newspaper archive since the documents in that repository have no prior labels other 
than the broad categorization provided by the OCR software which can be inaccurate. A 
clustering algorithm can thus be used to generate a taxonomy amongst similar articles. 

A wide variety of clustering algorithms exist including the i^T-means algorithm, hierar- 
chical clustering, the expectation maximization algorithm and their variants to name a few 
popular algorithms. In this study we focus on the if-means algorithm which requires setting 
a large number of parameters such as the number of clusters, an appropriate distance func- 
tion and a mechanism to initialize centroids. To estimate the performance of the algorithm, 
it is common practice to compare the labels obtained from it with the "ground truth". 
However, "ground truth" may not be available and/or can be subjective. 

Semi-supervised cl ustering algorithms have become very popular in data mining and 



knowledge discovery Zhu and Goldberg 2009] , These algorithms are typically used in sce- 
narios where only a small amount of data with prior knowledge (either as labels, constraints, 
etc.) is available in addition to a large proportion of unlabeled data. The design of a semi- 
supervised clustering algorithm depends on the mechanism by which the prior knowledge is 
incorporated - for example, (1) it may be available as pairwise constraints implying there 
is pre-existing knowledge about whether two instances should belong to the same cluster 
(Must- link) or different clusters (Cannot-link); (2) labels associated with each instance or 
(3) inferring clustering constraints based on neighborhoods derived from labeled examples. 
The literature in semi-supervised learning however does not consider exhaustively scenarios 
where the labels provided are subjective. This is the focus of our research. 

In this paper, we first discuss the X-means algorithm with seeding as an example of 
how differences in parameterization of the clustering algorithms can affect the final results 
of the document clustering task. Next, we describe a different setting for the same doc- 
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ument clustering task where domain knowledge comes as pairwise constraints. In either 
case the task of evaluation of the algorithms is hard, due to the subjective labels provided 
by our annotators. In the following subsections, we describe one after the other the stan- 
dard if-means algorithm, its semi-supervised counterpart obtained by careful seeding and 
constrained AT-means clustering with Must-link and Cannot-link constraints. 



5.1. The A'-Means and Seeded A-means Algorithms 

One of the oldest and most commonly used clustering algorithms is the AT-means algo- 



rithm Lloyd 1957 , MacQueen 1967 . Assume we are given an integer K and a set of TV 
data points X C M ; the goal is to partition X into K clusters, K < N. This can be 
achieved by choosing K centroids C\ , C2, • • • , Ck so as to minimize the potential function 
4> = J2 x ex m ^ n cec Dist[x — c], where Dist represents a distance function (such as squared 
cuclidcan, LI norm). The basic steps of the algorithm are as follows: Arbitrarily choose 
initial K centroids Ci, C2, • • • , Ck from X; for each i € {1, 2, • • • AT} set the cluster Ci to be 
the set of all points in X that are closer to centroid C$ than they are to centroid Cj, Vj 7^ i; 
for each i 6 {1, 2, • • ■ K} set the cluster centroid C, = rg-j Y^xed x \ tne l &st two steps are 
repeated until the process stabilizes and there are no new cluster assignments. 

5.1.1. Choice of Parameters:. There is much debate on how to choose a suitable number of 
clusters (AT) appropriate for the data set. For our experiments we relied on human annotators 
to come to a consensus regarding the choice of an appropriate AT as described in Section [4] 
The other parameter that warrants some discussion is the choice of initial seeds; we have 
used two different seeding mechanisms in our experiments: (a) randomly chosen seeds which 
do not use information about clusters that humans prod uced (b) a semi-supervised K- Means 



algorithm called Seeded K-Means Basu et al. 2002 . This algorithm assumes that there 
exists SCX, called the "seed set" on which supervision is provided by annotators; thus, 
for each Xi £ S the annotator indicates which cluster it seeds; there is at least one seed 
point Xi per cluster. Once appropriate parameters have been set, the labels from K-Means 
are compared with those inferred as "ground-truth" in our pilot study. Note that all articles 
where annotators did not agree on labels were designated to a category called "Other" . 

5.1.2. Testing the validity of clusters:. In order to measure the quality of the clusters produced 
by the K-means algorithm, we first compare them to human annotated data marking each 
instance as one of the six categories illustrated in Tabic ITTT 



This procedure allows us to 
quantitatively evaluate the system. The external cluster-validity measure used in this work 



was first suggested by Dom Dom 2001 and is equivalent to mutual information when 
cluster labels and class labels are exactly the same. Let each data set D have n instances 
Oi, O2, ••• , O n and we want to partition it into AT clusters. Let K = {1, 2, • • • 6} be the 
set of cluster labels and C = {1, 2, • • • ,6} be the expert annotated class labels assigned to 
the objects in D. Consider a two-dimensional contingency table, W — h(c, k) where h(c, k) 
represents the number of objects labeled class c are assigned to cluster A; by the algorithm. 
Then, if there is a perfect clustering T-L is a square matrix with only one non-zero element 
per row / column. The marginals are defined as h(c) = ^2 k h(c,k) and h(k) — ^2 c h(c,k). 
Since in our experiments the number of clusters are known apriori, the cluster-validity 
measure is essentially the empirical mutual information j(C, K) = H{C) — H(C\K), where 
H(C) = - £1^ ^log'^ and H{C\K) = - EL='i ^T^ff- 

In the second experiment, we questioned whether the choice of six categories was ap- 
propriate and instead compared the performance of seeded A"-means with the standard 
if-means algorithm, setting the number of clusters as suggested by each annotator. As 
before, mutual information was recorded over multiple trials and the averaged results are 
presented in the following section. 
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Table IV. Mutual Information over 10 trials using Random vs Semi- 
supervised Seeding techniques. The inferred "ground truth" sets 
K = 6. 



Seeding Algorithm 


Mean 


Std over 10 trials 


Random Sampling 


0.19 


±0.10 


Semi-supervised with Seeding 


0.26 


±0.07 



Table V. Average Mutual Information and standard deviation over 5 trials using Standard and Seeded A'-means algorithms. The 
number of clusters is as proposed by each annotator. Ann 1 • • ■ 6 refers to Annotator 1 through 6 who participated in the pilot study. 





Annl 


Ann2 


Ann3 


Ann4 


Ann5 


Ann6 


No of clusters 


8 


14 


13 


9 


10 


9 


Seeded A-Mcans Algorithm 


0.235±0.056 


0.131±0.048 


0.097±0.058 


0.214±0.107 


0.130±0.064 


0.210±0.089 


A-mcans Algorithm 


0.183±0.088 


0.130±0.057 


0.134±0.063 


0.114±0.049 


0.075±0.042 


0.097±0.062 



5.1.3. Empirical Evaluation:. The pilot study includes 25 articles. A bag-of-words represen- 
tation and tf-idf weights are obtained for these articles. Each article has 3210 features 
and one of possible six labels as indicated in section |4.1| The K-Means algorithm with 
K = 6 is run over 10 trials using both the random seeding and semi-supervised seeding. For 
semi-supervised seeding, one representative article from each category provided by human 
annotators is randomly selected for creating the seed; however care is taken to ensure that 
all six categories are represented by at least one seed. In each trial, the labels obtained 
after clustering are t este d against the inferred "ground-truth" generated by annotators (as 



described in section 4.1) and mutual information is recorded. The average and standard 
deviation of mutual information obtained over all trials is presented in Table |IV| Clearly, 
using Seeded K-Means with semi-supervision from annotators is more robust than the ran- 
dom seeding mechanism since the mutual information is higher and has a lower standard 
deviation over all the trials. 

In the second experiment, the standard and seeded if -means algorithms were run with 
different values of if as reflected in the annotator choices. The results are shown in Ta- 
ble |Vj For all the annotators, except annotator 3, Seeded if -means performs better than 
the standard if -means algorithm. Closer investigation revealed that annotator 3 had indeed 
provided multiple labels for a given article; for example, s/he surmised that a particular 
article in the pilot study could belong to the category "arts/human interest /politics" ; since 
the focus of our work was not on studying the impact of multiple labels, we decided to 
resolve ties by randomly selecting one label from all the suggestions; this process may have 
introduced bias and consequently affected the performance of seeded if -means. A better 
way to deal with multiple labelings would be to associate a probability of belonging to a 
particular class and then use this information to guide the seeding algorithm. 

In another experiment, we used the results from the pilot study to annotate unlabeled 
articles. We applied the Seeded K-Means algorithm with seeds suggested by annotators, on 
the remaining articles of the November 2nd, 1894 issue of The Sun newspaper that were 
not included in the pilot study. At least one representative article from each category was 
randomly selected from clusters found by humans for creating the seed and care is taken to 
ensure that all six categories are represented. We ran the Seeded K-Means algorithm ten 
times on the unlabeled articles. For each run, the number of clusters is fixed at six, the cosine 
distance metric is used to compare similarity between instances, and the same technique 
(randomly choose one of the representative documents of a category as the centroid) is used 
for generating seeds. The labels obtained from each run can be considered as produced by 
an automated annotator. Since each automated annotator only provides labels between 
1 and 6 we are able to use Krippendorff 's alphgp^] to measure inter-annotator agreement 
between them. It is seen that there is a very low agreement (a=0.316) when 200 resamplings 

16 We used the implementation available from http://ron.artstein.org/resources/ 
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Table VI. Confusion Matrix generated by two runs of Seeded K-Means on blind test data formed by articles 
of the newspaper not considered for the pilot study. 





Elections 


Medicine 


Other 


Death 


Human Interest 


Marriage 


Total 


Elections 





1 


2 





23 





26 


Medicine 


6 


7 





3 


1 


4 


21 


Other 


1 


1 


4 


2 


2 


9 


22 


Death 


7 








13 





1 


21 


Human Interest 


2 


16 





5 


1 


1 


25 


Marriage 


3 





14 








3 


20 



are used for calculating two-tailed 1% confidence intervals. To illustrate this point further, we 
closely examined the labels provided by two representative automated annotators as shown 
in the confusion matrix illustrated in Table [VI] For these two automated annotators, there is 
complete agreement on sub-categories for 20.7% of the articles used for blind testing; 61.9% 
of articles labeled "death" and 33% of articles labeled "Medicine" are correctly labeled. 
While these results are encouraging, there seems to be confusion in distinguishing between 
the "election" and "human interest" categories. It is worthwhile to note that humans also 
found it difficult to assign articles to the "human interest" category and thus this task 
appears to be significantly harder. An interesting direction for future work is to use other 
mechanisms of finding representative seeds to be used with the Seeded K-Means algorithm. 
One such approach is to identify a centroid of the human clusters by calculating the cosine 
distance of each pair of documents in each human cluster, estimate the mean and then find 
the document closest to the mean as the seed. 



5.2. Constrained Clustering 

The simplest fo rm of constrained clu stering uses instance- level constraints, introduced by 



Wagstaff et. al. Wagstaff et al. 2001 . There are primarily two types of constraints - must- 
link and cannot-link. A must-link constraint, defined as c = (a, b), requires a pair of points a 
and b to appear in the same cluster. It is an equivalence relation, hence it is symmetrical, 
reflexive, and transitive. This implies, if c = (a, b) and c = (6, c), then c = (a, c). A cannot-link 
constraint, written as c^(a,6), on the other hand, restricts both points from being part of 
the same cluster. 

An example of the above from our study is the following: assume an annotator has classi- 
fied articles 1, 3, 6, 7 in one category, and articles 2, 4, 5 in another category. The constrained 
clustering algorithm would create a series of must-links which might include c=(l, 3), which 
implies articles 1 and 3 are required to be part of the same group. It would also generate 
a set of cannot-links that may include c^(3,4), meaning articles 3 and 4 cannot be part of 
the same group. 

Constrained clustering can be categorized as (1) constraint-based and (2) distance-based. 
In constraint-based clustering, the algorithm utilizes constraints only after it has already 
classified all the points into initial clusters such as by using the if-means algorithm. It then 
verifies if there are any constraint violations and reallocates points to resolve violations. 
There can be a strict or soft enforcement of constraints. In strict-enforcement, no instance 
can violate any constraint. It either outputs a solution, or fails. A soft-enforcement allows 
violations, but adds penalties/*^] when required. The clustering algorithm seeks the best 
feasible assignments and always produces a solution. Distance-based constrained clustering, 
on the other hand, treats constraints as weights on the distance functions. A must-link 
"shrinks" the distance between two instances, while a cannot-link "widens" the distance 
between them. 



Penalties can be distance metrics or conditional probabilities. 
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The empirical results presented in this paper use a constraint- based algorithm (Pairwise 
Constrained Clustering K-Means (PCKMeans) |Basu et al. 2004] ) with probabilistic penal- 
ties. The algorithm is presented in Figure [T] As an initialization step, it receives a list of 
constraints and generates the transitive closure of the must-links. It is important to note 
that this step makes it susceptible to noise. Following this, the standard K-means algorithm 
is run. The difference between the standard K-means and PCKmeans occurs in terms of 
when and how they exploit the constraints. 



ALGORITHM 1: PCKMeans Algorithm 

input : Set of data points \ = {xi}" =1 , set of must-link constraints M — (xi, Xj), set 
of cannot-link constraints C — (xi,Xj), number of clusters fc, weight of 
constraints w. 

output: Disjoint k partitioning {xh}h=i °f X such that objective function r pc ]~ m is 
(locally) minimized. 

method 

1. Initialize clusters. 

Create the A neighborhoods {N p } p=1 from M and C. 

Sort the indices p in decreasing size of N p . 
if A > k then 

Initialize {fJ-^} l j! l=1 with centroids of {N p } p=1 . 
else 

Initialize {m| 1 °' > }£ =1 with centroids of {N p }p =1 . 
if 3 point x cannot-linked to all neighborhoods {N p } p=1 . then 
initialize ^x+i with x. 



Initialize remaining clusters at random. 



2. Repeat until convergence. 

assign cluster: Assign each data point x to the cluster h* (i.e. set Xh*~ )> f or 
h* = argmin h {\\\x - fi l *'\\ 2 +wJ2( x , Xj)eM l \ h + h\ + ™ E(*, % )eC l i h = 
estimate means: {^ +1) } k h=1 <- { -J^ J2 xey «+v x )l=v 

\X h I 

t-h- (t + 1). 



5.2.1. Effectiveness of Constraints:. How can we tell how useful constraints really are to 
the clustering process? This is important to consider because it may be possible tha t 



constraints introduced hurt instead of improving performance Davidson et al. 2006b 
The effectiveness of a set of constraints on a clustering problem can be measured by 
informativeness and coherence 



Informativeness is a measure of how much additional information about the domain the 
constraints provide to the clustering algorithm that it was not able to determine on its own. 
It is defined as: 



I a{C) = |^|Ecec unsa< ( c ' P ^)] 
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Fig. 5. First illustration of projected overlap between a must-link and a cannot-link. The must-link is the 
green line, the cannot-link is the red line, and the orange line is the projection of the cannot-link onto the 
must-link. In this example, there is no overlap between the two links. 




Fig. 6. Second illustration of projected overlap between a must-link and a cannot-link. The must-link is 
the green line, the cannot-link is the red line, and the orange line is the projection of the cannot-link onto 
the must-link. In this example, there is some overlap between the two links. 



where C is the set of constraints, A is an unconstrained clustering algorithm, Pa is the 
unconstrained clustering results of running A, and unsat(c, Pa) is 1 if Pa does not satisfy 
c and otherwise. 

Coherence tries to determine how "contradictory" the information is. It calibrates the 
level of agreement between constraints in set C, using a distance metric T>. It is measured 
by the projected overlap between two constraints i.e. how much overlap there is when one 
constraint is projected along the direction of the other. 

Let a be a cannot-link and b a must-link, then projection is estimated as follows: 

p = proj^a= (\a\ cosO) 4 

where 9 is the angle between the two vectors. To calculate how much of the projection of a 
overlaps with b, we compute the distance corresponding to the three cases: 
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Fig. 7. Third illustration of projected overlap between a must-link and a cannot-link. The must-link is the 
green line, the cannot-link is the red line, and the orange line is the projection of the cannot-link onto the 
must-link. In this example, there is complete overlap between the two links. 



if d. 



overlapb(a) 



f>2,6] 



< d, 



i>2,P2 1 "f>2, b 



< d h 



if <4 2 ,P2 < dba,bi> db 2 ,pi > d, 



if d, 



b 2 ,P2 



< d, 



62,61 



bi,P\ 
'62,61 

db 2 ,pi ^ db 2 ,bi 



where i»i , 62 , Pi , P2 are the beginning and end co-ordinates of vector b and the projection 
of a given by p. Figures [5] [6j and [7] provide examples of the projection scheme. There are 
two ways by which constraints can have zero projected overlap: (1) they are orthogonal to 
each other, so that neither link interferes with the other or (2) they are both the same type 
of links (both are must-links or both are cannot-links) , so any overlap that exists does not 
matter. Coherence (COH) of a constraint set C using a distance metric V is then be defined 
as the fraction of ML-CIp^ constraint pairs in the constraint set C, that have zero projected 
overlap i.e. 



COH v {C) = 



E me c M i , ceCcL K overla Pc m = and overlap^ 
\Cml\\Ccl\ 



0) 



(1) 



where Cml and Ccl represent the set of must-link and cannot-link constraints respectively 
in C; \Cml\ and \Ccl\ represents the cardinality of each set. A clustering algorithm, when 
given a constraint set with low coherence , gets confused on how to properly label points. 

Davidson et. al. Davidson et al. 2006b indicates that constraint sets with high informa- 
tiveness and high coherence improve performance, while sets with low informativeness and 
low coherence hurt performance. Low informativeness means the constraint set does not 
provide much helpful information. Low coherence means the information provided by the 
constraints set is contradictory and confusing. This is further illustrated in Figures [8j [9j [l0| 
and [11] 

5.2.2. Modeling Annotators and Assessing their Quality. Online digital archives such as the 
NYPL historic newspaper archive are unlabeled and getting a dataset labeled by annotators 
can be time consuming, if not impossible. With the advent of crowdsourcing, getting cheap 
labels is relatively easy. How good arc these labels? People with different levels of expertise 
- novices, scholars, biased and malicious annotators may provide inexpensive labels but 



18 ML: Must-Link, CL: Cannot-Link 
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Fig. 8. An example of a constraint set with high informativeness. The points represent articles. The black 
dots represent one cluster, and the blue dots represent a second cluster. This is typical of how a simple 
KMeans algorithm would classify points based on distance. The red edges are cannot-link constraints, and 
the green edges are must-link constraints. This constraint set indicates that many of the points close together 
do not belong in the same set, while the two points that the must-link connects belong in the same set. A 
simple KMeans algorithm would not partition the dataset in this way, meaning the constraints provide very 
valuable information. 




C 



Fig. 9. An example of a constraint set with low informativeness. The points represent articles. The black 
dots represent one cluster, and the blue dots represent a second cluster. This is typical of how a simple 
KMeans algorithm would classify points based on distance. The red edges are cannot-link constraints, and 
the green edges are must-link constraints. This constraint set does not provide any additional information 
that conflicts with how the simple KMeans algorithm would classfiy these points. All the links between a 
point in the first cluster and a point in the second cluster are cannot-links, while all the links connecting 
two points within the same cluster are must-links. 

chaffing meaningful information from it could be challenging. Accessing the "quality" of 
labels is therefore of interest. 

In the PCKMeans algorithm, the constraints can be used to encapsulate the prior knowl- 
edge an annotator has. The informativeness of constraints then provides a qualitative mea- 
sure of how good the suggested constraints are - in other words, informativeness measures 
how much additional information about the domain these constraints have provided by 
comparing the same clustering algorithm with and withoutF^l constraints. An annotator 



19 It should be noted that for the comparison to be meaningful, the same set of initial parameters for 
the unconstrained clustering algorithm should be used; for example if using the KMeans algorithm, the 
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Fig. 10. An example of a constraint set with high coherence. The points represent articles. The red edges 
are cannot-link constraints, and the green edges are must-link constraints. The points in the region near the 
two points connected by the constraint should have the same type of link. In this figure, all the must-links 
connect to points very near each other and do not show any contradictions. 




Fig. 11. An example of a constraint set with low coherence. The points represent articles. The red edges 
are cannot-link constraints, and the green edges are must-link constraints. The links connecting the points 
in the upper right-hand corner with other points in the figure are a mixture of must-links and cannot-links 
that seem to contradict each other. This constraint set can confuse the PCKMeans algorithm. 



with more "informative" constraints is clearly preferred over one whose constraints do not 
provide additional information about the domain. 

For our experiments, we modeled each annotator by the PCKMeans clustering algorithm; 
different sets of constraints were generated from the labels provided by the annotator; the 
performance of the KMeans algorithm with and without constraints was measured using 
informativeness . 

5.2.3. Empirical Evaluation:. For our experiments, we used the WekaUT extension for Weka 
from http : //www . cs . utexas . edu/users/ml/risc/ code/. 

Experiment 1: Replicating annotator performance by PCKMeans 

In the first experiment, we wanted to determine if PCKMeans could accurately replicate 
the performance of annotators and classify the twenty-five articles the way each annotator 



unconstrained algorithm should be trained each time with the same initial seeds, distance metric and the 
number of clusters. 
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Fig. 12. Average Mutual Information vs. Constraints. The mutual information compares each annotator's 
PCKMeans results with his/her own labels across increasing number of constraints. As expected, as the 
number of constraints increased, the PCKMeans results for each annotator became more in line with his/her 
labclings. 



did. We ran five trials of the clustering algorithm on the data for each annotator for 
different number of constraints, varying them from 10 to 300 in increments of 10. The 
constraints were generated as follows from the labels provided by the annotators: Randomly 
sample two instances; if they were assigned the same class label by the annotator there is a 
MUST link between them; if they are assigned different class labels, they have a CANNOT 
link between them. Note that C| 5 = 300 was the maximum number of constraints that 
could b e gen erated from the pilot study data. The mutual information (as described in 
Section 5.1.2 1 between each annotator's PCKMeans clustering results vs his/her own labels 
is estimated. The average mutual information over all the five trials is shown in Figure |12| 
It appears that that PCKMeans can accurately replicate the annotators only when supplied 
with a sufficiently large number of constraints. The mutual information results reached one 
only after 200 constraints were supplied to the algorithm. In the second experiment, the 
mutual information between each annotator's PCKMeans results and those inferred from 
the pilot study were computed. The results are illustrated in Figure [13) It was not expected 
that each annotator's labels would converge]^] to 1.0 and indeed some annotators such as 
Annotator 2, 3 and 5 have low average mutual information. However, the informativeness 
improved as expected. Another interesting thing was that the annotators who supplied the 
largest numbers of categories were the ones who had the lowest mutual information score 
since they were providing far more labels than what was inferred from ground truth. 



Experiment 2: Comparing Annotators 

In the next experiment, we had six PCKMeans clusterers, one trained for each annotator; 
we compared the output of each PCKMeans clusterer with the results of running standard 
KMeans (Figure 14) for inferred ground truth labels. This enabled us to measure the 



3 Convergence to one meant that the annotator's label exactly matched the inferred ground truth label. 
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Fig. 13. Average Mutual Information vs. Constraints. Here, the mutual information compares each annota- 
tor's PCKMeans results with the inferred ground truth labels. Here, we didn't expect the values to converge 
to one as the number of constraints increased, but we did expect the informativeness to improve. The graph 
indicate that the level of informativeness depended on the difference between the number of clusters each 
annotator assigned and the number in the inferred ground truth. 
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Fig. 14. Average Informativeness vs. Constraints. Here, the informativeness is measured by taking each 
annotator's PCKMeans results over the results of the standard KMcans for the inferred ground truth data. 
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Fig. 15. Average Informativeness vs. Constraints. Here, the informativeness is measured by taking each 
annotator's PCKMeans results over the results of PCKMeans for the inferred ground truth data. 



informativeness of constraints for each PCKMeans clusterer and thereby provide a way 
to compare annotators as described in Section [5. 2. 2 1 As before, the number of constraints 
was varied from 10 to 300 in increments of 10. Each colored line in Figure |l4| refers to the 
average performance of an annotator over five different trials. It is interesting to see that 
all of the annotators seem to have a high informativeness when the number of constraints is 
below 60. When the number increases beyond 60, the informativeness becomes more or less 
constant - in other words, increase in number of constraints arbitrarily does not provide 
more prior knowledge with respect to the unconstrained problem. It is also interesting to 
note that annotators 2 and 3 maintained a high informativeness compared to all the other 
annotators - a closer look at their annotations revealed that they had provided the largest 
number of categories in the pilot study; not only were they thinking hierarchically when 
providing the categories, they also seemed much more detail-oriented in their approach to 
providing annotations. 



Experiment 3: Studying the impact of providing (and inferring) different num- 
ber of categories 

Another interesting experiment that we conducted was to test the output of each of the 
PCKMeans clusterer with the results of running PCKMeans (Figure 15 1 for inferred ground 
truth labels. This allowed us to study the impact of an annotator suggesting different 
number of classes than those inferred from the pilot study. For example, if a PCKMeans 
clusterer with 10 constraints was generated from 13 class labels suggested by an annotator 
was it necessarily better (in terms of informativeness) than a PCKMeans algorithm with 
10 constraints generated from 6 class labels as inferred from the pilot study? In this case, 
annotator 3 still consistently had a higher informativeness than others, but surprisingly 
annotator 2 who provided the maximum number of classes in the pilot study did not have 
an overall high informativeness. This could be attributed to the fact that the constraints 
are generated randomly - it would be much more useful to study the case where the 
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pairwise constraints are provided manually by the annotators. 
Experiment 4: Studying Coherence 



Figures 16 and 17 compare the informativeness against the coherence of the clusters 
generated in the two cases described above. Each dot in the figure represents the informa- 
tiveness vs. coherence value for a given number of constraint^] 
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Fig. 16. Average Informativeness vs. Coherence. Here, the informativeness is measured by taking each 
annotator's PCKMeans results over the results of the standard KMcans for the inferred ground truth data. 



If we divide the graphs into four regions, each region indicates a different set of charac- 
teristics of the constraint set. The upper right region, the area with high informativeness 
and high coherence, indicates that the annotator's labelings are extremely helpful to the 
clustering algorithm. This is the ideal region. The algorithm will very likely produce great 
results. The lower left region of the graph indicates low informativeness and low coherence, 
meaning the classifications the annotator provides not only is not very helpful, but contra- 
dictory and confusing to the clusterer. This is the worst region, and any results produced by 
PCKMeans algorithm will mostly be poor. The other two regions are of mixed usefulness. 
The upper left region means the annotator provides a lot of but confusing information, 
while the lower right region means the annotator provides clear but very little information. 
As Figure[l7]shows, Annotator 3's datapoints fall closest to the ideal region, while Annotator 
6's points fall in the bad region. This indicates that Annotator 3 provides a better and clearer 
information to the clusterer. 



The reader is reminded that the constraints are varied from 10 to 300 in increments of 10 each. 
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Fig. 17. Average Informativeness vs. Coherence. Here, the informativeness is measured by taking each 
annotator's PCKMeans results over the results of PCKMeans for the inferred ground truth data. 



In addition, it is helpful to look at how spread out the datapoints are. A correct clustering 
in one region means the characteristics of the annotator's labels are consistent across a 
varying number of constraints. A wide and loose scattering means the labelings are not 
consistent. Comparing Figures [16| and [TTJ one sees that the datapoints for all the annotators 
are clustered much more tightly when compared to the PCKMeans algorithm than when 
compared to simple KMeans. This indicates that the PCKMeans ground truth is more 
consistent with what the annotators have in mind than the simple KMeans. 
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6. THE BODHI SYSTEM: LARGE SCALE OCR CORRECTION AND TAG COLLECTION AT 
THE NYPL 

To enable collaborative tagging, we are designing a system (named "BODHI") that can 
be integrated with the current architecture used in National Digital Newspaper Program 
(NDNP) and allow patrons to correct garbled OCR text, enter keywords for tagging articles, 
provide useful information about segmentation of the article (for example - is the article 
continued onto another page or not), links to other related articles, etc. These user-provided 
meta-data will augment the scanned text and image data obtained from the OCR scanning 
process. While the NDNP Content Management System has many modules to manage the 
newspaper digitization workflow from scanning microfilm to public delivery, the BODHI 
system focuses only on improving search and retrieval. 

Figure [18 presents the architecture and digital technology that will be used in devel- 
opment of the "BODHI" system. The file server stores the digitized newspaper images as 
jpeg, tiff, pdf or xml files (these are obtained from the NYPL after the OCR scanning pro- 
cess) - this is subjected to Extract, Transform and Load (ETL) operations using software 
developed in Java and XML and stored into a PostgreSQL database. In addition to the 
storage of OCR data from the newspaper articles, the database is also capable of storing 
user registration information, version and change tracking as required (assuming the patrons 
of the archive may edit an article multiple times and add different annotations) and is in- 
dexed using Apache Lucene (http : //lucene . apache . org/java/docs/index .html), which 
has an open-source Java-based indexing and search implementation as well as spellcheck- 
ing, hit highlighting and advanced analysis/tokenization capabilities. The components of 
the BODHI application include a user manager, article display, OCR corrector, tag (or an- 
notation) collector and search and retrieval manager. The prototype has been developed 
using Ruby-on-Rails and will be deployed using the Apache Phusion Passenger framework 
(http://www.modrails.com/). Once deployed on the library infrastructure, the web inter- 
face can be accessed by social media, mobile apps and the internet. 



7. CONCLUSION AND FUTURE WORK 

The New York Public Library has an archive of over 200,000 historical newspapers published 
between 1890 and 1920 which have been subjected to OCR and are currently stored in an 
online database making them accessible to patrons. Unfortunately search facilities on this 
database are rudimentary; newspapers are scanned on a page-by-page basis and article level 
segmentation is almost non-existent; the OCR scanning process introduces a lot of garbled 
text. In a bid to make these archives more accessible to the general public, text mining 
algorithms are being considered for categorization of articles. The OCR software provides 
a rough categorization, but a large chunk of the articles are labeled "article/editorial" 
without division into further meaningful categories. Thus, articles dealing with medicine 
and crime are deemed to belong to the same category; this makes search and retrieval of 
articles difficult. We designed a pilot study to observe if humans were able to find coherent 
categories in a small subset of articles. Our results indicate that the presence of small and 
noisy clusters in the data made it difficult to find an agreement in the number of clusters. 
We also evaluated the quality of the annotation provided by humans by measuring how 
much additional information they could provide to help the clustering algorithm. More 
specifically, the informativeness of constraints in a constrained clustering algorithm was 
used as a metric to evaluate how well they labeled articles. Our results from the pilot study 
are very encouraging and we are developing a large scale system (christened "BODHI" ) 
in collaboration with the New York Public Library to collect tags and incorporate the 
"wisdom of crowds" into machine learning algorithms. Future work involves development of 
more sophisticated non-parametric and bayesian text mining algorithms and experiments 
on a large scale using the deployed "BODHI" system. 
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A. USER INTERFACE DESIGN OF THE BODHI SYSTEM 

A prototype for the user interface in the BODHI system, which allows correction of OCR 
text and collection of tags from patrons was developed. The article manipulation module is 
capable of retrieving an article from the database based on a search criteria, displaying the 
scanned OCR text alongside a high resolution image, highlighting sections of it when clicked, 
allowing a user to edit the OCR text after checking the content in the high resolution image 
and storing the corrected text back into the database. Addition of tags and comments 
on an article-by-article basis is permissible. The module also has an user authentication 
mechanism for patrons registered to correct OCR, add notes or tags. 

Figure [19] in the Appendix presents screen-shots from the prototype. The application uti- 
lizes Rails 3.1 (on Ruby 1.9.2), a Model- View-Controller framework to map rows from the 
PostgreSQL database into objects that may be utilized by the client. The viewer extends 
the functionality of a JQuery plugin, called ImgAreaSelect 

(http://odyniec.net/projects/imgareaselect/examples.html), an open source inde- 
pendent project by Michael Wojciechowski. This plugin provides useful functions for ma- 
nipulating an image, such as selection of a speci fic a rea, customization of the behavior of 



the selection box (the dotted box shown in Figure 19 ) and functions that expose key events 
(such as moving the selection box) . In our application, the ImgAreaSelect plugin is utilized 
by loading two images - one in low resolution and the other in high resolution. The low 
resolution image allows a user to select the part of the picture s/he wants enlarged. The 
high resolution image is displayed by the "Viewer". Javascript code takes the selection of 
the low resolution image and scales to the a high resolution image and finally displays it in 
a box that overlaps the original image. 

The backend for the article manipulation module stores precise x and y coordinates for 
each word - the Rails application is capable of retrieving this data and using the ImgAreaS- 
elect plugin to pinpoint a word that is clicked on both the low resolution preview picture 
and the magnified high resolution picture. The highlighting of a word is done by a simple 
JQuery call that manipulates the Cascading Style Sheet (CSS) attributes of the Document 
Object Model (DOM) element in the HTML. 
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Bodhi NYPL Project 
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Fig. 19. The OCR correction module of the BODHI system, illustrating features to highlight text in articles, 
display high-resolution images, add tags and comments from patrons from the database. 



A basic login system is implemented using Devise l.q \ an open source plugin "gem" 
available to Rails. Each user will be required to create an account by specifying an email 
and password and use the information to actually correct a newspaper article and create 
tags. 



https : //github. com/plataf ormatec/devise 
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